Cocojunk

🚀 Dive deep with CocoJunk – your destination for detailed, well-researched articles across science, technology, culture, and more. Explore knowledge that matters, explained in plain English.

Navigation: Home

Pseudorandom number generator

Published: Sat May 03 2025 19:23:38 GMT+0000 (Coordinated Universal Time) Last Updated: 5/3/2025, 7:23:38 PM

Read the original article here.

The Forbidden Code: Understanding Pseudorandom Number Generators (PRNGs)

Welcome to a dive into one of the most fundamental yet often misunderstood tools in programming: the Pseudorandom Number Generator (PRNG). While standard programming classes might teach you to call a rand() function and assume you get something "random," the reality under the hood is far more complex, deterministic, and, frankly, filled with traps for the unwary. Understanding how PRNGs really work – their strengths, their weaknesses, and their sometimes-hidden predictable nature – is essential knowledge for anyone venturing beyond basic scripting, particularly in fields like simulation, gaming, and, most critically, security. This is the kind of insight often left out, the "forbidden code" you need to master.

What is a Pseudorandom Number Generator (PRNG)?

At its core, a PRNG is an algorithm. It's a set of mathematical instructions that produce a sequence of numbers. The goal is for this sequence to look like and behave like a sequence of truly random numbers.

Pseudorandom Number Generator (PRNG): An algorithm designed to generate a sequence of numbers that approximates the statistical properties of sequences of truly random numbers. It's also sometimes called a Deterministic Random Bit Generator (DRBG) because its output is entirely determined by its initial state.

The critical word here is "pseudorandom." Unlike truly random numbers (which might come from physical processes like atmospheric noise or radioactive decay, generated by hardware random number generators), the sequence produced by a PRNG is completely determined by an initial value.

Seed: The initial value used to start a pseudorandom number generator. If you start a PRNG with the same seed, it will produce the exact same sequence of numbers every single time.

This deterministic nature is the fundamental difference between "pseudo" random and "true" random.

PRNGs vs. Hardware Random Number Generators (HRNGs)

While HRNGs (like those based on thermal noise or quantum phenomena) can produce sequences closer to true randomness, PRNGs are vastly more common in software due to crucial advantages:

Speed: PRNGs are typically very fast, generating numbers algorithmically using simple computations. HRNGs often rely on slower physical processes and hardware interaction.
Reproducibility: Because a PRNG's sequence is determined by its seed, you can regenerate the exact same sequence simply by using the same seed. This is impossible with a true HRNG (unless you record the output).

This reproducibility, while disqualifying them as truly random, is incredibly valuable in many applications.

Why Use PRNGs? Applications They Don't Always Detail

Despite their deterministic nature, PRNGs are indispensable tools across various domains:

Simulations:
- Monte Carlo Methods: These techniques use repeated random sampling to obtain numerical results, often for complex problems that are difficult to solve analytically. Examples include simulating particle physics, financial market behavior, or weather patterns. A PRNG provides the necessary "randomness" to explore the vast possibility space, and its reproducibility is vital for debugging and verifying simulations.
- Statistical Sampling: When you need to select a random subset of data or generate data that follows a specific probability distribution, PRNGs are used.
Electronic Games:
- Procedural Generation: Creating vast game worlds, unique items, level layouts, or enemy behaviors automatically rather than manually. A PRNG is the engine behind this process. Using the same seed allows developers and players to revisit or share generated content.
- Game Logic: Determining unpredictable events like dice rolls, card shuffles, critical hits, enemy movements, or loot drops.
Cryptography (Requires Special PRNGs):
- Generating cryptographic keys, nonces (numbers used once in a communication), initialization vectors (IVs), or keystreams for stream ciphers. Crucially, cryptographic applications have extremely strict requirements for PRNGs that standard ones cannot meet.

The Hidden Dangers: Potential Issues They Won't Emphasize

Here's where you step into "The Forbidden Code." Standard library PRNGs (like the old rand() in C or Java's java.util.Random prior to recent updates) often have significant, non-obvious flaws. Relying on them blindly can lead to unreliable simulations, predictable game outcomes, or, worst of all, catastrophic security vulnerabilities.

John von Neumann, one of the pioneers, was famously skeptical of early PRNGs, stating:

"Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin."

While perhaps hyperbolic, his point highlights the fundamental challenge: creating something that behaves randomly using entirely predictable arithmetic.

Common issues found in poorly designed or outdated PRNGs include:

Shorter-than-Expected Periods: The period is the length of the sequence before it begins to repeat. A PRNG's period should ideally be extremely long. Some seeds might trigger cycles that are much shorter than the maximum possible period, leading to sequences that repeat prematurely.

Period: The length of the sequence of numbers produced by a PRNG before the sequence begins to repeat from the beginning. A short period means the sequence is highly predictable over time.
Lack of Uniformity of Distribution: For a large set of generated numbers, the distribution of values should be uniform (e.g., each number within the range should appear roughly the same number of times). Flawed PRNGs can produce numbers where some values or ranges occur more often than others.

Uniform Distribution: A probability distribution where all outcomes are equally likely. For numbers in a range [a, b], any number within that range has the same chance of being generated.
Correlation of Successive Values: Numbers in the sequence should be independent of previous numbers. In weak PRNGs, there can be statistical dependencies, meaning knowing a number (or a few numbers) gives you a hint about what the next number will be. This predictability is a major red flag.

Correlation: A statistical relationship between successive outputs of a PRNG, where one output provides information about the next. A good PRNG should have negligible correlation between outputs.
Poor Dimensional Distribution: When you group numbers from the sequence into tuples (e.g., pairs, triplets, etc.) and plot them in multi-dimensional space, a good PRNG's output should appear randomly scattered. Flawed PRNGs often show distinct patterns or lie only on certain hyperplanes. This makes them fail more sophisticated statistical tests and can be a problem in simulations that rely on multi-dimensional sampling.

Dimensional Distribution: How well sets of outputs from a PRNG fill a multi-dimensional space. Poor dimensional distribution means the points exhibit patterns or clustering when they should be uniformly scattered.
Distribution of Distances Between Values: The spacing between occurrences of specific values should also match that of a random sequence. Flawed PRNGs might show non-random patterns in these distances.

Historical Cautionary Tales

The dangers of using flawed PRNGs are not just theoretical. The RANDU generator, used for decades on IBM mainframes, was notoriously bad. Its outputs showed severe correlations and fell on specific planes in 3D space, rendering many scientific simulation results from that era questionable. The fact that its inadequacy went undetected for so long highlights how subtle these flaws can be.

Even widely used modern languages can ship with weak defaults. As noted in the International Encyclopedia of Statistical Science (2010):

"The list of widely used generators that should be discarded is much longer [than the list of good generators]. Do not trust blindly the software vendors. Check the default RNG of your favorite software and be ready to replace it if needed. This last recommendation has been made over and over again over the past 40 years. Perhaps amazingly, it remains as relevant today as it was 40 years ago."

A prime example was Java's java.util.Random, which relied on a Linear Congruential Generator (LCG) – a type known to be statistically weak – until Java 17 brought improvements. This demonstrates that even in mainstream software, you cannot trust the default PRNG without understanding its limitations. This is your call to look under the hood.

Higher-quality alternatives like the Mersenne Twister (published 1998) offer much better statistical properties and performance, but even newer, faster, and statistically robust generators exist. The key is knowing that options exist and how to evaluate them.

Peeking Under the Hood: Common PRNG Designs

Understanding the different families of PRNGs helps you recognize their potential strengths and weaknesses.

1. Generators Based on Linear Recurrences (The Traditional Approach)

For a long time, the go-to PRNG algorithms were based on relatively simple linear mathematical operations.

Linear Congruential Generators (LCGs): These generators are defined by a recurrence relation: X[n+1] = (a * X[n] + c) mod m.

Linear Congruential Generator (LCG): A type of PRNG that generates a sequence of pseudorandomized numbers calculated with a linear equation modulo m. The sequence is defined by a seed X[0] and constants 'a', 'c', and 'm'.

LCGs are easy to implement and fast. However, their linear nature makes them statistically weak, especially when used for simulations requiring higher-dimensional uniformity. The quote about LCGs and related methods leaving a "gap on each shelf about as big as your fist" in libraries due to unreliable scientific papers highlights just how significant their statistical flaws were.
Linear-Feedback Shift Registers (LFSRs): These are based on operations over a finite field, often GF(2) (addition and multiplication modulo 2, which correspond to XOR and AND operations). They shift bits in a register and use feedback from certain positions (determined by a "tap" polynomial) to generate the next bit.

Linear-Feedback Shift Register (LFSR): A digital shift register that cycles through a sequence of states determined by linear feedback. Outputs can be used as a pseudorandom bit sequence.

LFSRs can produce very long periods and are fast, often implemented directly in hardware. However, pure LFSR outputs also exhibit statistical weaknesses, particularly in their dimensional distribution.
Modern Linear Recurrence Generators: Significant advancements built upon these ideas, like the Mersenne Twister. Published in 1998, it is based on a linear recurrence but with a much larger state space and careful design to avoid the pitfalls of simpler linear methods.

Mersenne Twister: A widely used PRNG algorithm known for its very long period (2^19937 - 1) and excellent equidistribution properties in many dimensions.

The Mersenne Twister was a major step up, offering excellent statistical quality for simulations while being reasonably fast. Other modern generators like xorshift (introduced by George Marsaglia in 2003) and the WELL family (developed in 2006) further refine linear recurrence and bitwise operations (like XORs, shifts, and rotations) to achieve high speed and good statistical properties. xorshift generators, especially when combined with non-linear operations, are extremely fast and pass strong statistical tests. The WELL generators improved on some specific weaknesses of the Mersenne Twister related to state recovery.

2. Counter-Based Generators (The Parallel Revolution)

A newer class of PRNGs, particularly relevant in the age of multi-core processors and GPUs, are counter-based generators.

Counter-Based Random Number Generator (CBRNG): A PRNG where the internal state is primarily an integer counter and a key. The random output for a given index n is directly computed as output = f(n, key), where f is usually a cryptographic or near-cryptographic function.

Instead of updating an internal state iteratively (like X[n+1] = algorithm(X[n])), a CBRNG computes the Nth random number directly based on the index N and a fixed key.

The advantages of this design are significant, especially for parallel computing:

Reproducibility: The sequence is reproducible simply by knowing the key and the starting counter value.
Massive Parallelism: Each random number is generated independently of others in the sequence. This means different threads or GPU cores can compute non-overlapping parts of the sequence simply by using different ranges of the counter, without needing synchronization or shared state updates. This is a game-changer for Monte Carlo simulations on modern hardware.
Jumping Ahead: You can instantly compute the random number at index N without computing the N-1 preceding numbers. This allows for easy "jumping" to any point in the sequence or generating completely independent streams of random numbers (by using different keys or widely separated counter ranges).

Examples include Philox and Threefry, which use functions based on multiplication or reduced-strength block ciphers to mix the counter and key.

The Security Minefield: Cryptographically Secure PRNGs (CSPRNGs)

This is where the stakes get highest. For applications like cryptography or security-sensitive simulations, standard PRNGs are utterly insufficient. You need a Cryptographically Secure Pseudorandom Number Generator (CSPRNG).

Cryptographically Secure Pseudorandom Number Generator (CSPRNG): A PRNG with additional properties making its output computationally indistinguishable from true random numbers for an adversary who does not know the seed.

The requirement for a CSPRNG is far stronger than just passing statistical tests. A standard PRNG might pass tests for uniformity and correlation, but if you know a few outputs, you might still be able to predict the next ones. A CSPRNG must be designed so that predicting future outputs from past outputs (or even inferring the seed from outputs) is computationally infeasible, even for an attacker with significant resources.

This property, often called "forward secrecy" (unpredictability of future output) and "backward secrecy" (unpredictability of past output), is incredibly difficult to prove mathematically. Often, the security is based on the assumption that certain mathematical problems (like factoring large numbers) are hard to solve.

Types of CSPRNGs include:

Based on Existing Primitives: Using modes of operation from well-established cryptographic algorithms like stream ciphers or block ciphers (in counter or output feedback modes).
Dedicated Designs: Algorithms specifically designed with cryptographic security in mind, such as Yarrow (used in macOS/FreeBSD) and Fortuna. Microsoft's CryptGenRandom is a function implementing a CSPRNG.
Combination PRNGs: Attempting to mix outputs from several standard PRNGs to hopefully increase security (though this isn't always guaranteed to produce a CSPRNG).
Hardness Assumption Designs: PRNGs built directly on the assumed difficulty of mathematical problems, like Blum Blum Shub (based on quadratic residuosity) or the Micali–Schnorr generator. These often have strong security proofs but are typically much slower than other types.
Generic Constructions: Theoretical constructions showing that a CSPRNG can be built from any one-way function, but these are usually too slow for practical use.

The Dual_EC_DRBG Scandal: A Real-World "Forbidden Code" Example

Perhaps the most infamous example of why you must understand CSPRNGs is the Dual_EC_DRBG generator. This algorithm was standardized by NIST (a respected standards body), but later evidence strongly suggested that the NSA had deliberately inserted an asymmetric backdoor into it. This backdoor would have allowed someone (presumably the NSA, who knew the secret parameters) to predict the generator's output if they knew some initial values, while remaining unpredictable to anyone else. This was a deliberate weakening of a standard cryptographic tool, a true piece of "forbidden code" manipulation, and a stark reminder that you cannot blindly trust even officially sanctioned algorithms without intense public scrutiny.

The security of virtually all cryptographic protocols that rely on randomness (like generating session keys for secure connections) hinges on the assumption that the CSPRNG used is truly unpredictable to an attacker. Designing such algorithms is exceptionally challenging.

Evaluating PRNG Quality: The BSI Criteria

To assess the quality of deterministic random number generators, especially in security contexts, standards bodies provide criteria. The German Federal Office for Information Security (BSI) has well-known criteria:

K1: Generated sequences should be highly likely to be different from each other (if seeds are different).
K2: The sequence must be statistically indistinguishable from truly random numbers according to specified statistical tests. These tests check for things like:
- Monobit Test: Roughly equal numbers of 0s and 1s.
- Poker Test: How frequently specific groups of bits (like 4-bit chunks) appear.
- Runs Test: The frequency of consecutive sequences of identical bits (runs).
- Long Runs Test: Checks for excessively long runs of identical bits.
- Autocorrelation Test: Checks for correlations between bits at different positions in the sequence. In essence, K2 ensures the bits look independent and the distribution is uniform at a basic level.
K3: It should be computationally impossible for an attacker to predict future outputs or the generator's internal state, even if they know a subsequence of the output. This is crucial for forward secrecy and many cryptographic uses.
K4: It should be computationally impossible for an attacker to determine previous outputs or previous internal states, even if they know a subsequence of the output or the generator's current state. This adds backward secrecy, providing even stronger protection against compromise.

For any security-sensitive application, only generators meeting K3 or K4 standards are acceptable. Standard PRNGs typically only aim to pass K1 and K2 (statistical tests), which is why they are unsuitable for crypto.

The Mathematics Behind It: A Glimpse

Formally, a PRNG can be viewed as a function $f: \mathbb{N}_1 \rightarrow \mathbb{R}$ (mapping positive integers, representing the index in the sequence, to real numbers). For $f$ to be a PRNG approximating a target distribution $P$ on the real numbers $\mathbb{R}$, two main conditions hold:

The output values must fall within the expected range or set $A$ associated with the distribution $P$.
For any statistically testable property (represented by a set $E$ in a collection of sets $\mathfrak{F}$), the proportion of generated numbers falling into $E$ should converge to the probability $P(E)$ predicted by the target distribution as the length of the sequence increases. This is expressed mathematically as: $\forall E \in \mathfrak{F}, \forall \varepsilon > 0, \exists N \in \mathbb{N}_1, \forall n \geq N, \left| \frac{#{i \in {1, \dots, n}: f(i) \in E}}{n} - P(E) \right| < \varepsilon$ In simpler terms: as you generate more and more numbers, the fraction of numbers falling into any specific interval or set should get closer and closer to the theoretical probability of a truly random number from the target distribution falling into that set.

This mathematical definition underpins the statistical tests (like K2) used to evaluate PRNG quality.

Generating Non-Uniform Distributions: Transforming Uniformity

Most PRNG algorithms fundamentally generate numbers that approximate a uniform distribution over a specific range (like numbers between 0 and 1, or integers within a certain bit length). But what if you need numbers that follow a different distribution, like a Gaussian (normal) distribution, an exponential distribution, or a Poisson distribution (often needed for simulating real-world phenomena)?

You can achieve this by taking the output of a good uniform PRNG and applying a mathematical transformation. The most common method is Inverse Transform Sampling.

First, you need the Cumulative Distribution Function (CDF) of your target distribution. The CDF, $F(b)$, gives the probability that a random variable following the distribution will be less than or equal to a value $b$. For continuous distributions, this is the integral of the probability density function up to $b$.
Since the CDF maps values from the distribution to probabilities between 0 and 1 (inclusive), its inverse function, $F^{-1}(c)$, maps a probability $c$ back to a value $b$ from the distribution such that $F(b)=c$.
If you generate a number $c$ from a uniform distribution between 0 and 1, applying the inverse CDF $F^{-1}(c)$ to it will give you a number that follows your target distribution.

Inverse Transform Sampling: A method for generating random numbers from any probability distribution given its cumulative distribution function (CDF), by applying the inverse of the CDF to random numbers drawn from a uniform distribution.

Cumulative Distribution Function (CDF): For a real-valued random variable $X$, the function $F(x)$ is defined for every number $x$ by $F(x) = P(X \le x)$, which is the probability that $X$ will take a value less than or equal to $x$.

For example, to get numbers from a Gaussian distribution, you'd generate uniform numbers $c$ and compute $F^{-1}(c)$, where $F$ is the Gaussian CDF (related to the error function, erf).

However, applying this in practice isn't always straightforward or efficient:

You need the inverse CDF function, which isn't always available in a simple closed form.
Computing the inverse CDF repeatedly can be computationally expensive. Faster methods like the Ziggurat algorithm are often used for common distributions like Gaussian or exponential.
Real-world number representations are finite, meaning the infinite "tails" of distributions like the Gaussian must be truncated.

This shows another layer of complexity: even starting with a good uniform PRNG, generating other distributions requires careful implementation to maintain statistical quality and efficiency.

Early Attempts: The Middle-Square Method (A Historic Lesson)

Looking at early PRNGs provides valuable lessons in what not to do and the constraints pioneers faced. John von Neumann's "middle-square" method from 1946 is a classic example.

The algorithm is simple:

Start with a seed number (e.g., 4 digits).
Square the number.
Take the middle digits of the result (e.g., the middle 4 digits if the square is 8 digits) as the next "random" number.
Use this new number as the seed for the next iteration.

Example (with 4 digits, seed 1111):

Seed = 1111
Square = 1111 * 1111 = 1234321 (pad to 8 digits: 01234321)
Middle digits (the "random" number) = 2343
New seed = 2343
Square = 2343 * 2343 = 5489649 (pad to 8 digits: 05489649)
Middle digits = 4896
New seed = 4896
...and so on.

Von Neumann used 10-digit numbers, but the principle was the same.

Why did he use such a seemingly simplistic method? Speed and memory were severely limited on early computers like the ENIAC. Generating numbers arithmetically was hundreds of times faster than reading pre-computed random tables from punched cards. Von Neumann prioritized speed and the ability to reproduce sequences for debugging, even knowing the method had flaws. He worried complex "fixes" might introduce worse, harder-to-find errors.

The major problem, as he knew, is that sequences generated by middle-square degenerate quickly. They often enter short cycles (like 6100 -> 2100 -> 4100 -> 8100 -> 6100) or collapse to zero (0500 -> 2500 -> 2500... or 0000 -> 0000...). It utterly fails modern statistical tests.

While the original middle-square method is obsolete on its own, recent research has shown that combining it with other techniques, like a Weyl sequence (adding a non-integer constant modulo 1), can sometimes produce generators with good statistical properties and very long periods. This shows that sometimes, even seemingly broken old ideas can contribute to new designs when understood deeply.

Conclusion: The Unseen Engine

Pseudorandom number generators are the invisible engines driving much of modern software, from complex scientific simulations and video game worlds to the very security keeping your data safe. But unlike components with easily verifiable outputs, the quality of a PRNG's output is a subtle, statistical property that requires deep understanding and rigorous testing.

Relying on default, potentially weak, or compromised PRNGs without understanding their limitations is a gamble they won't always warn you about in school. By delving into their deterministic nature, common pitfalls, different architectural designs, and the stringent requirements for cryptographic security, you gain the "forbidden knowledge" needed to choose appropriate tools, identify potential weaknesses, and build more robust and reliable systems. Don't just call rand(); understand what happens when you do.